Gaussian Mixture Models (GMM) and EM Algorithm


Exploratory Data Analysis & Unsupervised Learning

     

Lecturer: Dr. HAS Sothea

Outline

1. Gaussian Mixture Models

  • 1.1. Motivation and Introduction
  • 1.2. Definition and Properties
  • 1.3. Parameter Estimation

2. EM Alogirthm

  • 2.1. Motivation and Introduction
  • 2.2. EM Algorithm Steps
  • 2.3. Applications

0. Motivation

0. Motivation

Old Faithful dataset (\(272\) rows, \(2\) columns)

Code
import pandas as pd                 # Import pandas package
import seaborn as sns               # Package for beautiful graphs
import plotly.express as px
import matplotlib.pyplot as plt     # Graph management
# path = "https://gist.githubusercontent.com/curran/4b59d1046d9e66f2787780ad51a1cd87/raw/9ec906b78a98cf300947a37b56cfe70d01183200/data.tsv"                       # The data can be found in this link
df0 = pd.read_csv(path0 + "/faithful.csv" )  # Import it into Python
df0.head(5)                        # Randomly select 4 points
eruptions waiting
0 3.600 79
1 1.800 54
2 3.333 74
3 2.283 62
4 4.533 85
Code
fig = px.scatter(df0, x='waiting', y='eruptions', size='eruptions')
fig.update_layout(width=440, height=400, title='Eruptions vs waitting time')
fig.show()

0. Motivation

Old Faithful dataset (\(272\) rows, \(2\) columns)

KMeans problems

  • Convex/spherical clusters are likely formed than wide elliptical shape.
  • Ideal for linearly separable clusters which is unlikely in practice.
Code
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
km = KMeans(n_clusters=2)
scaler = StandardScaler()
X = scaler.fit_transform(df0)
km.fit(X)
df0['Group'] = [str(x+1) for x in km.labels_]
fig = px.scatter(df0, x='waiting', y='eruptions', size='eruptions', color='Group')
fig.update_layout(width=500, height=400, title='Eruptions vs waitting time')
fig.show()

0. Motivation

Old Faithful dataset (\(272\) rows, \(2\) columns)

Clustering types

  • Hard Assignment: one object must belong to a unique cluster.
  • Soft Assignment: one object has a probability of belongging to any cluster.
  • The later falls into probabilistic approach of clustering.
Code
fig = px.scatter(df0, x='waiting', y='eruptions',
    size='eruptions', marginal_x='histogram', marginal_y='histogram', color='Group')
fig.update_layout(width=400, height=400, title='Eruptions vs waitting time')
fig.show()

I. Mixture Models

1. Mixture Models

1.1. Mixture Density

  • Mixture Models: a statistical tool that represents data as a combination of several simple components.
  • Mixture model form: Assume \(K\geq 1\) clusters, \[\forall \text{x}\in\mathcal{D}\subset\mathbb{R}^d: f(\text{x})=\sum_{k=1}^K\color{blue}{\pi_k}f_k(\text{x}; \color{blue}{\theta_k}),\] where,
    • \(\color{blue}{\vec{\pi}}=(\color{blue}{\pi_1},\dots,\color{blue}{\pi_K}):\) proportion of each component, \(\sum_{k=1}^K\color{blue}{\pi_k}=1\).
    • \(\color{blue}{\vec{\theta}}=(\color{blue}{\theta_1},\dots,\color{blue}{\theta_K}):\) parameters of each component.
    • \(f_k(\text{x};\color{blue}{\theta_k}):\) density of component \(k\) evaluated at any data point \(\text{x}\).
Code
fig.update_layout(height=450)
fig.show()
Code
import numpy as np
import plotly.graph_objects as go

cluster1 = df0[df0['eruptions'] < 3]
cluster2 = df0[df0['eruptions'] >= 3]
cols = ['eruptions', 'waiting']
# 3. Calculate Parameters Manually
# We need Mean vector and Covariance matrix for each cluster
mu1, cov1 = cluster1[cols].mean().values, cluster1[cols].cov().values
mu2, cov2 = cluster2[cols].mean().values, cluster2[cols].cov().values

# 4. Define Manual Gaussian Function
def manual_gaussian_pdf(x_grid, y_grid, mu, cov):
    # Unpack grid to (N, 2) points
    points = np.column_stack([x_grid.flatten(), y_grid.flatten()])
    
    det = np.linalg.det(cov)
    inv_cov = np.linalg.inv(cov)
    norm_const = 1.0 / (2 * np.pi * np.sqrt(det))
    
    diff = points - mu
    exponent = -0.5 * np.sum((diff @ inv_cov) * diff, axis=1)
    z = norm_const * np.exp(exponent)
    return z.reshape(x_grid.shape)

# 5. Create Grid for plotting the surface
x_range = np.linspace(df0['eruptions'].min()*0.85, df0['eruptions'].max()*1.1, 100)
y_range = np.linspace(df0['waiting'].min()*0.85, df0['waiting'].max()*1.1, 100)
xx, yy = np.meshgrid(x_range, y_range)

# 6. Calculate Z (Height)
z1 = manual_gaussian_pdf(xx, yy, mu1, cov1)
z2 = manual_gaussian_pdf(xx, yy, mu2, cov2)
z_total = z1 * cluster1.shape[0]/df0.shape[0] + z2 * cluster2.shape[0]/df0.shape[0]

fig = go.Figure()
fig.add_trace(go.Surface(
    z=z_total, x=xx, y=yy,
    opacity=0.6,               # Make it slightly transparent
    colorscale='Viridis',
    showscale=False,
    name='Density'
))

# Add the Data Points (The "Scatter")
# We set z=0 to place them on the "floor"
fig.add_trace(go.Scatter3d(
    x=df0['eruptions'].values,
    y=df0['waiting'].values,
    z=np.zeros(len(df0)),
    mode='markers',
    marker=dict(
        size=5, 
        color=df0['Group'].map({'1': 'red', '2': 'blue'})),
    name='Observations'
))

# Update Layout for better viewing
fig.update_layout(
    title='Density of eruptions and waiting time',
    scene=dict(
        xaxis_title='Eruption Time',
        yaxis_title='Waiting Time',
        zaxis_title='Density',
        camera = dict(
            eye=dict(x=0.2, y=2, z=0.65)
        )
    ),
    width=400,
    height=450
)
fig.show()

1. Mixture Models

1.2. Observed Likelihood/Log-Likelihood

  • Maximum Likelihood Estimation: is the classical method to estimate \(f\) i.e., \([\color{blue}{\vec{\pi}},\color{blue}{\vec{\theta}}]\).

1. Observed Likelihood/Log-likelihood

  • The Observed Likelihood for \([\color{blue}{\vec{\pi}},\color{blue}{\vec{\theta}}]\) on the observation \(\text{x}_1,\dots,\text{x}_n\) is given by: \[L_{\text{obs}}(\color{blue}{\vec{\pi}},\color{blue}{\vec{\theta}})=p(X|\color{blue}{\vec{\pi}},\color{blue}{\vec{\theta}})=\prod_{i=1}^nf(\text{x}_i)=\prod_{i=1}^n\sum_{k=1}^K\color{blue}{\pi_k}f_k(\text{x}_i; \color{blue}{\theta_k}).\]
  • The Observed log-likelihood is given by: \[\mathcal{L}_{\text{obs}}(\color{blue}{\vec{\pi}},\color{blue}{\vec{\theta}})=\ln[p(X|\color{blue}{\vec{\pi}},\color{blue}{\vec{\theta}})]=\sum_{i=1}^n\ln\left[\sum_{k=1}^K\color{blue}{\pi_k}f_k(\text{x}_i; \color{blue}{\theta_k})\right]\ \ \color{red}{(1)}.\]
  • \(\color{red}{(1)}\) is intractable in general!
Code
fig.show()

1. Mixture Models

1.3. Latent Variables/Complete Log-Likelihood

  • Latent Variable (unobserved): \(\forall\text{x}_i\in\mathcal{D}\subset\mathbb{R}^d\), let \(\color{red}{\text{z}_i}=(\color{red}{z_{i1}},\dots,\color{red}{z_{iK}})\) such that \(\color{red}{\text{z}}_i|\text{x}_i\sim \mathcal{M}(\color{red}{\vec{\gamma_i}})\) i.e. \(\color{red}{z_{ik}}\in\{0,1\}\) & \(\sum_{k=1}^K\color{red}{z_{ik}}=1, \forall i=1,\dots,n\).
  • Key idea: \(\color{red}{z_{ik}}=1\Leftrightarrow \text{x}_i\) belongs to cluster \(k\).
  • It’s not observed but crucial for optimizing \(\color{red}{(1)}\).

2. Complete Likelihood/Log-likelihood

  • Complete Likelihood/Log-Likelihood for \([\color{blue}{\Theta}, \color{red}{\Gamma}]=[(\color{blue}{\vec{\pi}}, \color{blue}{\vec{\theta}}),(\color{red}{\vec{\gamma_i}})_{i=1}^{n}]\) on the observations \((\text{x}_i)\) and latents \((\color{red}{\text{z}_i})\) is given by: \[\begin{align*}L_{\text{com}}(\color{blue}{\Theta}, \color{red}{\Gamma})&=p(X,\color{red}{Z}|\color{blue}{\Theta}, \color{red}{\Gamma})=\prod_{i=1}^n\prod_{k=1}^K\left[\color{blue}{\pi_k}f_k(\text{x}_i; \color{blue}{\theta_k})\right]^{\color{red}{z_{ik}}}.\\ \mathcal{L}_{\text{com}}(\color{blue}{\Theta}, \color{red}{\Gamma})&=\ln[p(X,\color{red}{Z}|\color{blue}{\Theta},\color{red}{\Gamma})]=\sum_{i=1}^n\sum_{k=1}^K\color{red}{z_{ik}}\left[\ln(\color{blue}{\pi_k})+\ln\left(f_k(\text{x}_i; \color{blue}{\theta_k})\right)\right]\ \ \color{red}{(2)}.\end{align*}\]

1. Mixture Models

1.3. Latent Variables/Complete Log-Likelihood

2. Complete Likelihood/Log-likelihood

  • Complete Likelihood/Log-Likelihood for \([\color{blue}{\Theta}, \color{red}{\Gamma}]=[(\color{blue}{\vec{\pi}}, \color{blue}{\vec{\theta}}),(\color{red}{\vec{\gamma_i}})_{i=1}^{n}]\) on the observations \((\text{x}_i)\) and latents \((\color{red}{\text{z}_i})\) is given by: \[\begin{align*}L_{\text{com}}(\color{blue}{\Theta}, \color{red}{\Gamma})&=p(X,\color{red}{Z}|\color{blue}{\Theta}, \color{red}{\Gamma})=\prod_{i=1}^n\prod_{k=1}^K\left[\color{blue}{\pi_k}f_k(\text{x}_i; \color{blue}{\theta_k})\right]^{\color{red}{z_{ik}}}.\\ \mathcal{L}_{\text{com}}(\color{blue}{\Theta}, \color{red}{\Gamma})&=\ln[p(X,\color{red}{Z}|\color{blue}{\Theta},\color{red}{\Gamma})]=\sum_{i=1}^n\sum_{k=1}^K\color{red}{z_{ik}}\left[\ln(\color{blue}{\pi_k})+\ln\left(f_k(\text{x}_i; \color{blue}{\theta_k})\right)\right]\ \ \color{red}{(2)}.\end{align*}\]
  • Q1: What do we want?
  • A1: Maximize \(\mathcal{L}_{\text{obs}}(\color{blue}{\Theta})=\ln[p(X|\color{blue}{\Theta})]\) for \(\color{blue}{\Theta}\).
  • Observed and Complete Likelihood is related by: \[L_{\text{obs}}(X|\color{blue}{\Theta})=\mathbb{E}_{\color{red}{Z}}[L_{\text{com}}(X,\color{red}{Z}|\color{blue}{\Theta}, \color{red}{\Gamma})].\]

1. Mixture Models

1.4. Evidence Lower Bound (ELBO)

  • Observed and Complete Likelihood is related by: \[L_{\text{obs}}(X|\color{blue}{\Theta})=\mathbb{E}_{\color{red}{Z}}[L_{\text{com}}(X,\color{red}{Z}|\color{blue}{\Theta}, \color{red}{\Gamma})].\]
  • OLL \(\mathcal{L}_{\text{obs}}\), also known as Evidence, is related to the CLL \(\mathcal{L}_{\text{com}}\) by: \[\begin{align*} \mathcal{L}_{\text{obs}}(\color{blue}{\Theta})&=\mathcal{L}_{\text{com}}(\color{blue}{\Theta}, \color{red}{\Gamma})-\ln\left[p(\color{red}{Z}|X,\color{red}{\Gamma})\right].\end{align*}\]
  • For any distribution \(q(\color{red}{Z})\), by taking expectation w.r.t \(\color{red}{Z}\) on both sides, \[\begin{align*} \mathcal{L}_{\text{obs}}(\color{blue}{\Theta})&=\underbrace{\mathbb{E}_{q(\color{red}{Z})}[\mathcal{L}_{\text{com}}(\color{blue}{\Theta}, \color{red}{\Gamma})-\ln[q(\color{red}{Z})]]}_{\color{blue}{\text{ELBO: }}\text{denoted by }\color{blue}{Q}(\color{blue}{\Theta}, \color{red}{\Gamma})}+\underbrace{\mathbb{E}_{q(\color{red}{Z})}\left[\ln\left[\frac{q(\color{red}{Z})}{p(\color{red}{Z}|X,\color{red}{\Gamma})} \right]\right]}_{\color{red}{\text{KL}}[q(\color{red}{Z})||p(\color{red}{Z}|X,\color{red}{\Gamma})]\geq 0}\ \ \color{red}{(3)}.\end{align*}\]
  • Key breakthrough: In \(\color{red}{(3)}\) if we take \(q(\color{red}{Z})=p(\color{red}{Z}|X,\color{red}{\Gamma})\), then maximizing ELBO: \(\color{blue}{Q}(\color{blue}{\Theta}, \color{red}{\Gamma})\) w.r.t to \(\color{blue}{\Theta}\) will also maximize \(\mathcal{L}_{\text{obs}}(\color{blue}{\Theta})\) as well.

1. Mixture Models

1.4. Evidence Lower Bound (ELBO): Summary

  • Given data \(D=\{\text{x}_1,\dots,\text{x}_n\}\subset\mathbb{R}^d\) and number of components \(K\geq 1\), we model the density of the data by \[f(\text{x})=\sum_{k=1}^K\color{blue}{\pi_k}f_k(\text{x}; \color{blue}{\theta_k}).\]
  • Latent variable \(\color{red}{\text{z}_i}=(\color{red}{z_{i1}},\dots,\color{red}{\text{z}_{iK}}):\) unobserved group membership of \(\text{x}_i\).
  • Given \(\text{x}_i\), it may belong to component \(k\) with probability \(\color{red}{\gamma_{ik}}\) called ‘responsibility’. One has \[\color{red}{\Gamma}=\begin{pmatrix}\color{red}{\gamma_{11}}&\dots&\color{red}{\gamma_{1K}}\\ \vdots&\ddots&\vdots\\ \color{red}{\gamma_{n1}}&\dots&\color{red}{\gamma_{nK}} \end{pmatrix}\in \{0,1\}^{n\times d}\ (\text{clustering signature}).\]
  • The best \(\color{blue}{\Theta}=(\color{blue}{\vec{\pi}},\color{blue}{\vec{\theta}})\) is estimated by iteratively maximizing ELBO: \(\color{blue}{Q}(\color{blue}{\Theta}, \color{red}{\Gamma})\).

II. Gaussian Mixture Model and EM Algorithm

1. Gaussian Mixture Models (GMM)

1.1. GMM Definition

  • The component of Mixture Model \(f_k\) can be modeled using any distribution.
  • The most common choice is \(f_k(\text{x})=\mathcal{N}(\text{x};\color{blue}{\vec{\mu}_k},\color{blue}{\Sigma_k})\) where
    • \(\color{blue}{\vec{\mu}_k}\in\mathbb{R}^d\) is the mean vector
    • \(\color{blue}{\Sigma_k}\in\mathbb{R}^{d\times d}\) is the covariance matrix of component \(k\) of the model.
Code
import numpy as np
import plotly.graph_objs as go
mu1, Sigma1 = [0, 0], [[1, 0], [0, 3]]

mu2, Sigma2  = [5, 5], [[2, -1], [-1, 2]]

mu3, Sigma3 = [-1, 6], [[2, 1.2], [1.2, 1]]

mu4, Sigma4 = [6, 0], [[3, 0.1], [0.1, 0.25]]
x1 = np.random.multivariate_normal(mean=mu1, cov=Sigma1, size=100)
x2 = np.random.multivariate_normal(mean=mu2, cov=Sigma2, size=100)
x3 = np.random.multivariate_normal(mean=mu3, cov=Sigma3, size=100)
x4 = np.random.multivariate_normal(mean=mu4, cov=Sigma4, size=100)
fig = go.Figure(
    go.Scatter(x=x1[:,0], y = x1[:,1],
               mode = "markers",
               name = r"$\color{blue}{\mu_1=\begin{pmatrix}0\\ 0\end{pmatrix}, \Sigma_1=\begin{pmatrix}1&0\\0&3\end{pmatrix}}$",
               showlegend = True,
               marker = dict(size = 10)))
fig.add_trace(
    go.Scatter(x=x2[:,0], y = x2[:,1],
               mode = "markers",
               name = r"$\color{red}{\mu_1=\begin{pmatrix}5\\ 5\end{pmatrix},\Sigma_2=\begin{pmatrix}2&-1\\-1&2\end{pmatrix}}$",
               showlegend = True,
               marker = dict(size = 10)))
fig.add_trace(
    go.Scatter(x=x3[:,0], y = x3[:,1],
               mode = "markers",
               name = r"$\color{green}{\mu_1=\begin{pmatrix}-1\\ 6\end{pmatrix}, \Sigma_3=\begin{pmatrix}2&1.2\\1.2&1\end{pmatrix}}$",
               showlegend = True,
               marker = dict(size = 10)))
fig.add_trace(
    go.Scatter(x=x4[:,0], y = x4[:,1],
               mode = "markers",
               name = r"$\color{purple}{\mu_1=\begin{pmatrix}6\\ 0\end{pmatrix},\Sigma_4=\begin{pmatrix}3&0.1\\0.1&0.25\end{pmatrix}}$",
               showlegend = True,
               marker = dict(size = 10)))

fig.update_layout(title = "2D Gaussian Distribution",
                  width = 500,
                  height = 280)
fig.show()
Code
import numpy as np
import plotly.graph_objects as go
from scipy.stats import multivariate_normal

# 3. Create a Grid for Density Calculation
# We calculate the bounds to ensure our grid covers all data
all_x = np.concatenate([x1[:,0], x2[:,0], x3[:,0], x4[:,0]])
all_y = np.concatenate([x1[:,1], x2[:,1], x3[:,1], x4[:,1]])
x_range = np.linspace(all_x.min()-1, all_x.max()+1, 100)
y_range = np.linspace(all_y.min()-1, all_y.max()+1, 100)
x_grid, y_grid = np.meshgrid(x_range, y_range)
pos = np.dstack((x_grid, y_grid))

# 4. Calculate PDFs (Z-axis height)
z1 = multivariate_normal(mu1, Sigma1).pdf(pos)
z2 = multivariate_normal(mu2, Sigma2).pdf(pos)
z3 = multivariate_normal(mu3, Sigma3).pdf(pos)
z4 = multivariate_normal(mu4, Sigma4).pdf(pos)

# 5. Build the 3D Figure
fig = go.Figure()

# --- Add Scatter Points on the floor (z=0) ---
# We verify z=0 for all points to place them on the 'floor'
for data, color, name in [(x1, 'blue', 'Cluster 1'), 
                          (x2, 'red', 'Cluster 2'), 
                          (x3, 'green', 'Cluster 3'), 
                          (x4, 'purple', 'Cluster 4')]:
    fig.add_trace(go.Scatter3d(
        x=data[:,0], y=data[:,1], z=np.zeros(len(data)),
        mode='markers',
        marker=dict(size=6, color=color, opacity=0.8),
        name=name + ' (Data)'
    ))

# --- Add Density Surfaces ---
# We use separate surfaces for each cluster with matching colors
# 'showscale=False' hides the colorbar to keep it clean
# colorscale=[[0, color], [1, color]] forces a solid color surface
fig.add_trace(go.Surface(x=x_grid, y=y_grid, z=z1, opacity=0.3, showscale=False, 
                         colorscale=[[0, 'blue'], [1, 'blue']], name='Density 1'))
fig.add_trace(go.Surface(x=x_grid, y=y_grid, z=z2, opacity=0.3, showscale=False, 
                         colorscale=[[0, 'red'], [1, 'red']], name='Density 2'))
fig.add_trace(go.Surface(x=x_grid, y=y_grid, z=z3, opacity=0.3, showscale=False, 
                         colorscale=[[0, 'green'], [1, 'green']], name='Density 3'))
fig.add_trace(go.Surface(x=x_grid, y=y_grid, z=z4, opacity=0.3, showscale=False, 
                         colorscale=[[0, 'purple'], [1, 'purple']], name='Density 4'))

# 6. Layout adjustments
fig.update_layout(
    title="3D Gaussian Mixture Densities",
    width=500, height=280,
    scene=dict(
        xaxis_title='X',
        yaxis_title='Y',
        zaxis_title='Probability Density',
        aspectmode='cube' # Keeps x, y, z scales proportional
    ),
    margin=dict(l=0, r=0, b=0, t=50) # Tight layout
)

fig.show()

1. Gaussian Mixture Models (GMM)

  • GMM form: \(f(\text{x})=\sum_{k=1}^K\frac{\color{blue}{\pi_k}}{(2\pi)^{d/2}|\color{blue}{\Sigma_k}|^{1/2}}\exp\left[-\frac{1}{2}(\text{x}-\color{blue}{\vec{\mu_k}})^T\color{blue}{\Sigma_k}^{-1}(\text{x}-\color{blue}{\vec{\mu_k}})\right]\).
  • The following notation are used:
    • Model parameters: \(\color{blue}{\Theta}=(\color{blue}{\vec{\pi}},(\color{blue}{\vec{\mu_k}})_k,(\color{blue}{\Sigma}_k)_k)\): all model parameters.
    • Responsibilty: \(\color{red}{\Gamma}=(\color{red}{\color{red}{\gamma_{ik}}})_{i,k}\): Chance of obs \(i\) being in class \(k\).
  • The true posterior can be computed using Bayes’ rule: \[p(\color{red}{z_{ik}}=1|\text{x}_i, \color{red}{\vec{\gamma}_{i}})=\frac{p(\text{x}_i,\color{red}{z_{ik}}=1|\color{red}{\vec{\gamma}_{i}},\color{blue}{\Theta})}{p(\text{x}_i|\color{blue}{\Theta})}=\frac{\color{blue}{\pi_k}\mathcal{N}(\text{x}_i; \color{blue}{\vec{\mu_k}}, \color{blue}{\Sigma_k})}{\sum_{j=1}^K\color{blue}{\pi_j}\mathcal{N}(\text{x}_i; \color{blue}{\mu_j}, \color{blue}{\Sigma_j})}\]
  • ELBO: \[\color{blue}{Q}(\color{blue}{\Theta}, \color{red}{\Gamma}) = \sum_{i=1}^{n} \sum_{k=1}^{K} \color{red}{\gamma_{ik}} \left[ \ln(\color{blue}{\pi_k}) + \ln[\mathcal{N}(\text{x}_i;\color{blue}{\vec{\mu_k}}, \color{blue}{\Sigma_k})] \right]-\mathbb{E}[q(\color{red}{Z})].\]

2. EM Algorithm for GMM

  • 💡 Key: For any choice of \(\color{red}{\Gamma}\), one has: \(\max_{\color{blue}{\Theta}}\color{blue}{Q}(\color{blue}{\Theta}, \color{red}{\Gamma})\leq\max_{\color{blue}{\Theta}}\mathcal{L}_{\text{obs}}(\color{blue}{\Theta})\).
  • EM Algorithm: An iterative method to find MLE of parameters \(\color{blue}{\Theta}\) of the models with latent variables \(\color{red}{\Gamma}\) by alternating between two steps until convergence. For \(t=0,1,...,T\):
    • E-step: Compute the responsibilities \(\color{red}{\gamma_{ik}}^{(t+1)}\) from current parameters \(\color{blue}{\Theta}^{(t)}\): \[\color{red}{\gamma_{ik}}^{(t+1)}=p(\color{red}{z_{ik}}=1|\text{x}_i, \color{red}{\vec{\gamma}_{i}}^{(t)},\color{blue}{\Theta}^{(t)})=\frac{\color{blue}{\pi_k}^{(t)}\mathcal{N}(\text{x}_i; \color{blue}{\vec{\mu_k}}^{(t)}, \color{blue}{\Sigma_k}^{(t)})}{\sum_{j=1}^K\color{blue}{\pi_j}^{(t)}\mathcal{N}(\text{x}_i; \color{blue}{\mu_j}^{(t)}, \color{blue}{\Sigma_j}^{(t)})}\text{ then }\color{blue}{Q}(\color{blue}{\Theta}, \color{red}{\Gamma}^{(t+1)}).\]
    • M-step: Update the parameters \(\color{blue}{\Theta}^{(t)}\to\color{blue}{\Theta}^{(t+1)}=\arg\max_{\color{blue}{\Theta}}\color{blue}{Q}(\color{blue}{\Theta}, \color{red}{\Gamma}^{(t+1)})\): \[\begin{align*} \color{blue}{\pi_k}^{(t+1)}&=\frac{1}{n}\sum_{i=1}^n\color{red}{\gamma_{ik}}^{(t+1)},\\ \color{blue}{\vec{\mu_k}}^{(t+1)}&=\frac{\sum_{i=1}^n\color{red}{\gamma_{ik}}^{(t+1)}\text{x}_i}{\sum_{i=1}^n\color{red}{\gamma_{ik}}^{(t+1)}},\\ \color{blue}{\Sigma_k}^{(t+1)}&=\frac{\sum_{i=1}^n\color{red}{\gamma_{ik}}^{(t+1)}(\text{x}_i-\color{blue}{\vec{\mu_k}}^{(t+1)})(\text{x}_i-\color{blue}{\vec{\mu_k}}^{(t+1)})^T}{\sum_{i=1}^n\color{red}{\gamma_{ik}}^{(t+1)}}.\end{align*}\]

2. EM Algorithm for GMM

  • E-step: Compute the responsibilities \(\color{red}{\gamma_{ik}}^{(t+1)}\) from current parameters \(\color{blue}{\Theta}^{(t)}\): \[\color{red}{\gamma_{ik}}^{(t+1)}=p(\color{red}{z_{ik}}=1|\text{x}_i, \color{red}{\vec{\gamma}_{i}}^{(t)},\color{blue}{\Theta}^{(t)})=\frac{\color{blue}{\pi_k}^{(t)}\mathcal{N}(\text{x}_i; \color{blue}{\vec{\mu_k}}^{(t)}, \color{blue}{\Sigma_k}^{(t)})}{\sum_{j=1}^K\color{blue}{\pi_j}^{(t)}\mathcal{N}(\text{x}_i; \color{blue}{\mu_j}^{(t)}, \color{blue}{\Sigma_j}^{(t)})}\text{ then }\color{blue}{Q}(\color{blue}{\Theta}, \color{red}{\Gamma}^{(t+1)}).\]
  • M-step: Update the parameters \(\color{blue}{\Theta}^{(t)}\to\color{blue}{\Theta}^{(t+1)}=\arg\max_{\color{blue}{\Theta}}\color{blue}{Q}(\color{blue}{\Theta}, \color{red}{\Gamma}^{(t+1)})\): \[\begin{align*} \color{blue}{\pi_k}^{(t+1)}&=\frac{1}{n}\sum_{i=1}^n\color{red}{\gamma_{ik}}^{(t+1)},\\ \color{blue}{\vec{\mu_k}}^{(t+1)}&=\frac{\sum_{i=1}^n\color{red}{\gamma_{ik}}^{(t+1)}\text{x}_i}{\sum_{i=1}^n\color{red}{\gamma_{ik}}^{(t+1)}},\\ \color{blue}{\Sigma_k}^{(t+1)}&=\frac{\sum_{i=1}^n\color{red}{\gamma_{ik}}^{(t+1)}(\text{x}_i-\color{blue}{\vec{\mu_k}}^{(t+1)})(\text{x}_i-\color{blue}{\vec{\mu_k}}^{(t+1)})^T}{\sum_{i=1}^n\color{red}{\gamma_{ik}}^{(t+1)}}.\end{align*}\]

\[\text{Process: }\color{blue}{\Theta}^\color{black}{(0)}\underbrace{\longrightarrow}_{\text{E-step}}\color{red}{\Gamma}^{(1)}\underbrace{\longrightarrow}_{\text{M-step}}\color{blue}{\Theta}^\color{black}{(1)}\underbrace{\longrightarrow}_{\text{E-step}}\color{red}{\Gamma}^{(2)}\underbrace{\longrightarrow}_{\text{M-step}}\color{blue}{\Theta}^\color{black}{(2)}\dots\underbrace{\longrightarrow}_{\text{E-step}}\color{red}{\Gamma}^{(t)}\underbrace{\longrightarrow}_{\text{M-step}}\color{blue}{\Theta}^\color{black}{(t)}...\]

2. EM Algorithm for GMM

Application on Old Faithful Dataset

Code
from scipy.stats import multivariate_normal

# Data Preparation
df0['color'] = np.where(df0['eruptions'] < 3, 'red', 'blue')
X = df0[['eruptions', 'waiting']].values

# Random Initialization
np.random.seed(42)
K = 2
n_samples, n_features = X.shape
random_indices = np.random.choice(n_samples, K, replace=False)
mu = X[random_indices].copy()
global_cov = np.cov(X.T)
cov = [global_cov.copy() for _ in range(K)]
pi = np.array([0.5, 0.5])

# Visualization Grid
x_range = np.linspace(X[:, 0].min() - 0.5, X[:, 0].max() + 0.5, 60)
y_range = np.linspace(X[:, 1].min() - 5, X[:, 1].max() + 5, 60)
xx, yy = np.meshgrid(x_range, y_range)
pos = np.dstack((xx, yy))

def get_mixture_density(pi, mu, cov, grid_pos):
    z = np.zeros(grid_pos.shape[:2])
    for k in range(K):
        z += pi[k] * multivariate_normal(mu[k], cov[k]).pdf(grid_pos)
    return z

# EM Algorithm with Frame Capture
iterations = 15
frames = []

for i in range(iterations):
    # E-STEP
    probs = np.zeros((n_samples, K))
    for k in range(K):
        probs[:, k] = pi[k] * multivariate_normal(mu[k], cov[k]).pdf(X)
    resp = probs / (probs.sum(axis=1, keepdims=True) + 1e-10)
    
    # Text for parameters
    param_text = (
        f"<b>Iteration {i}</b><br>"
        f"π: [{pi[0]:.2f}, {pi[1]:.2f}]<br>"
        f"μ1: [{mu[0][0]:.2f}, {mu[0][1]:.2f}]<br>"
        f"μ2: [{mu[1][0]:.2f}, {mu[1][1]:.2f}]"
    )
    
    z_total = get_mixture_density(pi, mu, cov, pos)
    
    # Store state: Surface + Observations + Centroids
    frames.append(go.Frame(
        data=[
            go.Surface(z=z_total, x=xx, y=yy), # Trace 0
            go.Scatter3d(x=df0['eruptions'], y=df0['waiting'], z=np.zeros(n_samples)), # Trace 1 (Static)
            go.Scatter3d(                     # Trace 2 (Moving Centroids)
                x=mu[:, 0], 
                y=mu[:, 1], 
                z=[0, 0], 
                mode='markers', 
                marker=dict(size=10, color='black', symbol='diamond', line=dict(width=2, color='white')),
                name='Centroids'
            )
        ],
        name=f"Iteration {i}",
        layout=go.Layout(annotations=[dict(
            text=param_text, align='left', showarrow=False,
            x=0.05, y=0.95, xref='paper', yref='paper',
            bgcolor="rgba(255, 255, 255, 0.7)", bordercolor="black", borderwidth=1
        )])
    ))
    
    # M-STEP
    N_k = resp.sum(axis=0)
    for k in range(K):
        mu[k] = (resp[:, k, np.newaxis] * X).sum(axis=0) / N_k[k]
        diff = X - mu[k]
        weighted_diff = resp[:, k, np.newaxis, np.newaxis] * np.einsum('ni,nj->nij', diff, diff)
        cov[k] = weighted_diff.sum(axis=0) / N_k[k]
        pi[k] = N_k[k] / n_samples

# Build Figure
fig = go.Figure(
    data=[
        go.Surface(
            z=frames[0].data[0].z, x=xx, y=yy, 
            opacity=0.6, colorscale='Viridis', showscale=False, name='Density'
        ),
        go.Scatter3d(
            x=df0['eruptions'].values, y=df0['waiting'].values, z=np.zeros(n_samples),
            mode='markers', marker=dict(size=4, color=df0['color'].values, opacity=0.8),
            name='Observations'
        ),
        go.Scatter3d(
            x=frames[0].data[2].x, y=frames[0].data[2].y, z=[0, 0],
            mode='markers', marker=dict(size=8, color='black', symbol='diamond', line=dict(width=2, color='white')),
            name='Centroids'
        )
    ],
    layout=go.Layout(
        width=400, height=470,
        title="EM Algorithm for GMM on Old Faithful Dataset",
        scene=dict(
            xaxis_title='Eruptions', yaxis_title='Waiting', zaxis_title='Density',
            camera=dict(eye=dict(x=0, y=1.5, z=0.6), center=dict(x=0, y=0, z=-0.2))
        ),
        annotations=frames[0].layout.annotations,
        updatemenus=[{
            "buttons": [
                {"args": [None, {"frame": {"duration": 400, "redraw": True}, "fromcurrent": True}],
                 "label": "Play", "method": "animate"},
                {"args": [[None], {"frame": {"duration": 0, "redraw": True}, "mode": "immediate"}],
                 "label": "Pause", "method": "animate"}
            ],
            "type": "buttons", "showactive": False, "x": -0.05, "y": 0.1, "xanchor": "left", "yanchor": "top"
        }],
        sliders=[{
            "steps": [
                {"args": [[f.name], {"frame": {"duration": 0, "redraw": True}, "mode": "immediate"}],
                 "label": str(k), "method": "animate"} for k, f in enumerate(frames)
            ],
            "active": 0, "currentvalue": {"prefix": "Iteration: "}, "pad": {"t": 50}
        }]
    ),
    frames=frames
)

fig.show()
  • Q\(^*\): Why does it work?
  • A\(^*\) Because each step increases the ELBO:
    • E-step: \[\color{blue}{Q}(\color{blue}{\Theta}^{(t)}, \color{red}{\Gamma}^{(t)})\leq \color{blue}{Q}(\color{blue}{\Theta}^{(t)}, \color{red}{\Gamma}^{(t+1)})=\mathcal{L}_{\text{obs}}(\color{blue}{\Theta}^{(t)}).\]
    • M-step: \[\mathcal{L}_{\text{obs}}(\color{blue}{\Theta}^{(t)})=\color{blue}{Q}(\color{blue}{\Theta}^{(t)}, \color{red}{\Gamma}^{(t+1)})\leq \color{blue}{Q}(\color{blue}{\Theta}^{(t+1)}, \color{red}{\Gamma}^{(t+1)}).\]
    • Chain of \(Q(\color{blue}{\Theta}, \color{red}{\Gamma})\) in EM-Algorithm: \[\begin{align*}&\mathcal{L}_{\text{obs}}(\color{blue}{\Theta}^{(0)})\overset{\color{red}{(3)}}{=}\color{blue}{Q}(\color{blue}{\Theta}^{(0)}, \color{red}{\Gamma}^{(1)})\overset{\color{red}{(M)}}{\leq} \color{blue}{Q}(\color{blue}{\Theta}^{(1)}, \color{red}{\Gamma}^{(1)})\\ \overset{\color{red}{(E)}}{\leq}&\mathcal{L}_{\text{obs}}(\color{blue}{\Theta}^{(1)})\overset{\color{red}{(3)}}{=}\color{blue}{Q}(\color{blue}{\Theta}^{(1)}, \color{red}{\Gamma}^{(2)})\overset{\color{red}{(M)}}{\leq} \color{blue}{Q}(\color{blue}{\Theta}^{(2)}, \color{red}{\Gamma}^{(2)})\\ \overset{\color{red}{(E)}}{\leq}&\mathcal{L}_{\text{obs}}(\color{blue}{\Theta}^{(2)})\overset{\color{red}{(3)}}{=}\color{blue}{Q}(\color{blue}{\Theta}^{(2)}, \color{red}{\Gamma}^{(3)})\overset{\color{red}{(M)}}{\leq} \color{blue}{Q}(\color{blue}{\Theta}^{(3)}, \color{red}{\Gamma}^{(3)})\\ \vdots&\end{align*}\]

2. EM Algorithm for GMM

Issues of EM in GMM

  • EM algorithm only guarantees convergence to a local maximum of the observed data log-likelihood \(\mathcal{L}_{\text{obs}}(\color{blue}{\Theta})\) (may results in bad clusters).
  • Issues in GMM:
    • Initialization sensitivity: Poor initialization can lead to suboptimal local maxima (poeple often use K-means to initialize means).
    • Number of components (K): Choosing too few or too many components can lead to underfitting or overfitting (next part).
    • Singular covariance matrices: If a component may collapse to a single point, its covariance matrix becomes singular, causing numerical instability (poeple often add a small value to the diagonal of covariance matrices).
    • Convergence criteria: The choice of convergence threshold can affect the quality of the solution and computational efficiency.

3. Parameter tuning

  • Unlike for hard clustering (KMeans), choosing K: and other parameters for GMM is based on probabilistic reasoning.
  • For any \(K\) and candidate parameters \(\color{blue}{\Theta}\), log-likelihood: indicates the goodness of fit of the model to the data: \[\mathcal{L}_{\text{obs}}(\color{blue}{\Theta})=\sum_{i=1}^{n}\ln\left(\sum_{k=1}^K\color{blue}{\pi_k}\mathcal{N}(\text{x}_i;\color{blue}{\vec{\mu_k}}, \color{blue}{\Sigma_k})\right).\]
  • However, simply maximizing \(\mathcal{L}_{\text{obs}}(\color{blue}{\Theta})\) may lead to overfitting (e.g., increasing \(K\) always increases likelihood). Thus, we use model selection criteria that balance fit and complexity:
  • Akaike Information Criterion (AIC): \[\text{AIC} = \underbrace{- 2\mathcal{L}_{\text{obs}}(\color{blue}{\Theta})}_{\color{blue}{\text{Goodness of fit}}} + \underbrace{2\color{red}{p}}_{\color{red}{\text{Complexity}}},\] where \(\color{red}{p}\) is the number of parameters.
  • Bayesian Information Criterion (BIC): \[\text{BIC} = \underbrace{- 2\mathcal{L}_{\text{obs}}(\color{blue}{\Theta})}_{\color{blue}{\text{Goodness of fit}}} + \underbrace{\color{red}{p}\ln(n)}_{\color{red}{\text{Complexity}}},\] where \(n\) is the number of data points.

3. Parameter tuning

  • Akaike Information Criterion (AIC): \[\text{AIC} = \underbrace{- 2\mathcal{L}_{\text{obs}}(\color{blue}{\Theta})}_{\color{blue}{\text{Goodness of fit}}} + \underbrace{2\color{red}{p}}_{\color{red}{\text{Complexity}}},\] where \(\color{red}{p}\) is the number of parameters.
  • Bayesian Information Criterion (BIC): \[\text{BIC} = \underbrace{- 2\mathcal{L}_{\text{obs}}(\color{blue}{\Theta})}_{\color{blue}{\text{Goodness of fit}}} + \underbrace{\color{red}{p}\ln(n)}_{\color{red}{\text{Complexity}}},\] where \(n\) is the number of data points.
Code
from sklearn.mixture import GaussianMixture
n_components_range = range(1, 11)
cv_types = ['full', 'tied', 'diag', 'spherical']
bic_scores = []
aic_scores = []

results = {ctype: {'aic': [], 'bic': []} for ctype in cv_types}
for ctype in cv_types:
    for n in n_components_range:
        # Fit GMM
        gmm = GaussianMixture(n_components=n, covariance_type=ctype, random_state=42)
        gmm.fit(X)
        
        # Store metrics
        results[ctype]['aic'].append(gmm.aic(X))
        results[ctype]['bic'].append(gmm.bic(X))

import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px

# Create subplots: 2 rows, 1 column
fig = make_subplots(
    rows=2, cols=1,
    vertical_spacing=0.12,
    shared_xaxes=True  # Share x-axis to zoom both together
)

# Define a color map for consistency
colors = px.colors.qualitative.Set1
type_colors = {ctype: colors[i % len(colors)] for i, ctype in enumerate(cv_types)}

for ctype in cv_types:
    color = type_colors[ctype]
    
    # 1. AIC Trace (Solid Line, Circle Marker)
    fig.add_trace(
        go.Scatter(
            x=list(n_components_range),
            y=results[ctype]['aic'],
            mode='lines+markers',
            name=f'Cov: {ctype}',
            line=dict(color=color),
            marker=dict(symbol='circle'),
            legendgroup=ctype  # Group AIC and BIC controls
        ),
        row=1, col=1
    )

    # 2. BIC Trace (Dashed Line, X Marker)
    fig.add_trace(
        go.Scatter(
            x=list(n_components_range),
            y=results[ctype]['bic'],
            mode='lines+markers',
            name=f'Cov: {ctype}',
            line=dict(color=color, dash='dash'),
            marker=dict(symbol='x'),
            legendgroup=ctype,
            showlegend=False  # Hide duplicate legend entries
        ),
        row=2, col=1
    )

# Update Layout
fig.update_layout(
    height=390, width=500,
    hovermode="x unified",  # Show values for all lines at the hover point
    template="ggplot2"
)

# Axis Labels
fig.update_xaxes(title_text="Number of Components (K)", row=2, col=1)
fig.update_yaxes(title_text="AIC", row=1, col=1)
fig.update_yaxes(title_text="BIC", row=2, col=1)

fig.show()
  • Four types of covariance structures are available in sklearn’s GaussianMixture:
    • full: each component has its own \(\color{blue}{\Sigma_k}\).
    • tied: all components share the same \(\color{blue}{\Sigma}\).
    • diag: each has diagonal cov \(\color{blue}{\Sigma_k} = \color{blue}{\text{diag}}\).
    • spherical: each component has its own single variance (isotropic) \(\color{blue}{\Sigma_k} = \color{blue}{\sigma I_d}\).
  • Interpreting the plots:
    • Elbow/Minimum: Look for the \(K\) with elbow or lowest score.
    • Covariance Type: All seem reasonable except for spherical with best \(K=2\).
    • Overfitting: Scores flat/rise after \(K=2\).

Summary: GMM & EM Algorithm

  1. The Core Concept
  • Probabilistic Clustering: GMM provides soft assignments (probabilities). It assumes data is generated from a mixture of several Gaussian distributions.
  • Flexibility: Can model clusters of different sizes and elliptical shapes (via the Covariance Matrix), not just spheres.
  1. The Engine: Expectation-Maximization (EM)
  • Iterative Process: Alternates between estimating hidden variables and optimizing parameters.
    • E-Step (Expectation): Calculate “Responsibilities” (\(\color{red}{\gamma}_{ik}\)). How likely is it that point \(\text{x}_i\) belongs to cluster \(k\)?
    • M-Step (Maximization): Update parameters \((\color{blue}{\vec{\pi}},\color{blue}{\vec{\mu}},\color{blue}{\vec{\Sigma}})\) using these responsibilities as weights.
  • Convergence: Guaranteed to increase the likelihood (or the lower bound Q) at each step, converging to a local optimum.

Summary: GMM & EM Algorithm

  1. Critical Challenges
  • Local Optima: The algorithm is sensitive to initialization. It may get stuck in a “bad” solution. (Mitigation: Run multiple times or use K-Means initialization).
  • Singularities: If a cluster collapses onto a single point, variance \(\approx 0\) and likelihood \(\to\infty\). (Mitigation: Regularization).
  1. Model Selection
  • Finding K: Do not use Inertia. Use AIC or BIC to balance model fit against complexity. Look for the minimum score.
  • Covariance Type: Tune full, tied, diag, or spherical to prevent overfitting.

🥳 Yeahhhh….









Let’s Party… 🥂